Social Network Team Assignment

GROUP 12: Alicja Wilk | Nijanth Anand | Qinyu Xia | Wai Phyo | Yiting Fu

First, let's import the necessary packages and load in the two data sets we'll use for the entirety of this assignment (products and copurchase).

In [51]:
library(car)
library(dplyr)
library(tidyr)
library(igraph)
library(ggplot2)
library(corrplot)
In [2]:
copurchase <- read.csv("copurchase.csv", stringsAsFactors = FALSE)
products <- read.csv("products.csv", stringsAsFactors = FALSE)

Products contains information about each individual product with the following descriptors:

  • Id: Product id (number 1, 2, ..., 262109)
  • title: Name/title of the product
  • group: Product group (Book, DVD, Video or Music etc.)
  • salesrank: Amazon Salesrank
  • review_cnt: Total number of product reviews available on Amazon
  • downloads: Total number of downloads
  • rating: User rating on a scale of 1 to 5
In [3]:
head(products)
idtitlegroupsalesrankreview_cntdownloadsrating
1 Patterns of Preaching: A Sermon Sampler Book 396585 2 2 5.0
2 Candlemas: Feast of Flames Book 168596 12 12 4.5
3 World War II Allied Fighter Planes Trading Cards Book 1270652 1 1 5.0
4 Life Application Bible Commentary: 1 and 2 Timothy and Titus Book 631289 1 1 4.0
5 Prayers That Avail Much for Business: Executive Book 455160 0 0 0.0
6 How the Other Half Lives: Studies Among the Tenements of New YorkBook 188784 17 17 4.0

Copurchase contains the co-purchase information where a "source" product sale also leads to a "target" product sale.

  • Source: This is the focal product that was purchased
  • Target: People who bought “Source” also purchased the Target product
In [4]:
head(copurchase)
SourceTarget
1 2
1 4
1 5
1 15
2 11
2 12

First, we will define the boundary of the co-purchase network to only products that are books and have a salesrank value between 0 and 150,000 inclusive.

In [5]:
books.products <- filter(products, group == "Book" 
                         & salesrank <= 150000 & salesrank >= 0)
books.copurchase <- filter(copurchase, Source %in% books.products$id
                           & Target %in% books.products$id)

We will then find the indegree of each product, to show how many edges are to the each product in co-purchase network.

In [6]:
indegree.df <- summarize(group_by(books.copurchase, Target), indegree = n())
names(indegree.df)[1]<-"id"
head(indegree.df)
idindegree
12 5
3353
39 4
74 1
77 3
7811

Likewise, for the outdegree of each product.

In [7]:
outdegree.df <- summarize(group_by(books.copurchase, Source), outdegree = n())
names(outdegree.df)[1]<-"id"
head(outdegree.df)
idoutdegree
121
741
771
791
1171
1201

Now we merge the two created dataframes (indegree.df and outdegree.df) back into our original products dataframe while creating a new column "degree" which is the sum of indegree and outdegree.

In [8]:
books.graph <- merge(books.products, indegree.df, by="id", all.x = TRUE)
books.graph <- merge(books.graph, outdegree.df, by="id", all.x = TRUE)
books.graph$indegree[is.na(books.graph$indegree)] <- 0
books.graph$outdegree[is.na(books.graph$outdegree)] <- 0
books.graph <- mutate(books.graph, degree = indegree + outdegree)
head(books.graph)
idtitlegroupsalesrankreview_cntdownloadsratingindegreeoutdegreedegree
12 Fantastic Food with Splenda : 160 Great Recipes for Meals Low in Sugar, Carbohydrates, Fat, and CaloriesBook 24741 12 12 4.5 5 1 6
33 Double Jeopardy (T*Witches, 6) Book 97166 4 4 5.0 53 0 53
39 Night of Many Dreams : A Novel Book 57186 22 22 3.5 4 0 4
45 Beginning ASP.NET Databases using C# Book 48408 4 4 4.0 0 0 0
74 Service Delivery (It Infrastructure Library Series) Book 27507 2 2 4.0 1 1 2
77 Water Touching Stone Book 27012 11 11 4.5 3 1 4

We then pick a product with the highest degree in order to find its subcomponent, i.e. all the products that are connected to this focal product.

However, there seems to be a tie with two products (33 and 4429), both having a degree of 53. We chose to work with 4429 as our focal product.

In [9]:
filter(books.graph, degree == max(books.graph$degree))
idtitlegroupsalesrankreview_cntdownloadsratingindegreeoutdegreedegree
33 Double Jeopardy (T*Witches, 6) Book 97166 4 4 5.0 53 0 53
4429 Harley-Davidson Panheads, 1948-1965/M418Book 147799 3 3 4.5 52 1 53

Next, we construct a subgraph using a subcomponent with product 4429 as its root node.

The resulting subgraph has 904 vertices with 1173 vertices.

In [10]:
g <- graph_from_data_frame(books.copurchase, directed = TRUE)
sg <- induced_subgraph(g, subcomponent(g, "4429", "all"), impl = "auto")
sg <- simplify(sg, remove.multiple = F, remove.loops = T)
V(sg)
E(sg)
+ 904/904 vertices, named, from 76cf6cc:
  [1] 77     130    148    187    193    224    321    322    422    556   
 [11] 577    626    724    1051   1644   1817   1822   1851   1971   2071  
 [21] 2210   2279   2285   2326   2330   2332   2343   2423   2470   2501  
 [31] 2505   2558   2572   2657   2658   2806   2807   2959   3032   3119  
 [41] 3191   3217   3306   3427   3588   3670   3737   3861   3909   4002  
 [51] 4014   4038   4068   4099   4140   4174   4184   4185   4222   4223  
 [61] 4345   4429   4977   4993   4994   5018   5059   5163   5164   5293  
 [71] 5355   5388   5623   5638   5639   5655   5670   5821   5851   5875  
 [81] 6012   6014   6058   6059   6392   6411   6445   6546   6711   6713  
 [91] 6807   6808   6817   6942   7196   7198   7222   7233   7325   7376  
+ ... omitted several vertices
+ 1173/1173 edges from 76cf6cc (vertex names):
 [1] 77  ->422  130 ->78   148 ->302  187 ->321  187 ->322  187 ->78  
 [7] 193 ->224  224 ->193  224 ->33   321 ->187  321 ->322  321 ->78  
[13] 322 ->187  322 ->321  322 ->78   422 ->77   422 ->1644 556 ->78  
[19] 577 ->33   626 ->33   724 ->302  1051->302  1644->422  1644->5293
[25] 1817->976  1822->193  1822->724  1851->78   1971->193  2071->3155
[31] 2210->2279 2210->2285 2279->2210 2279->2326 2285->2330 2326->193 
[37] 2326->2210 2330->2343 2330->2345 2332->4140 2343->2285 2343->2330
[43] 2423->5410 2470->556  2501->3588 2505->2501 2558->33   2572->4184
[49] 2572->4185 2657->2658 2658->77   2806->2807 2807->302  2959->1673
[55] 3032->2558 3119->976  3191->2279 3217->4319 3306->2071 3306->4345
+ ... omitted several edges

Its diameter (longest geodesic) traverses 10 vertices, starting from vertice 37895 to 6676.

In [11]:
diameter <- get_diameter(sg)
diameter
+ 10/904 vertices, named, from 76cf6cc:
 [1] 37895 27936 21584 10889 11080 14111 4429  2501  3588  6676 
In [12]:
V(sg)$color <- ifelse(V(sg)$name %in% diameter$name, "red", "lightblue")
V(sg)["4429"]$color <- "green"
V(sg)["33"]$color <- "gold"
E(sg)$color <- "darkgray"
E(sg,path=diameter)$color <- "red"
E(sg)$width <- 1
E(sg,path=diameter)$width <- 3
options(repr.plot.width = 100, repr.plot.height = 100)
plot(sg, layout=layout_with_fr, vertex.size=1, vertex.label=NA, edge.arrow.size=0.05)

Insights from the subgraph visualizations:

  • There seems to be two main clusters (green = 4429 and gold = 33)
  • The diameter (path is colored red) traverses through our root node 4429

Next, we can can look at the cumulative degree distribution of our subcomponent.

In [13]:
degree.distr <- degree_distribution(sg, cumulative = TRUE, mode = "all")
options(repr.plot.width = 10, repr.plot.height = 10)
p <- ggplot() + 
    geom_line(aes(x = c(1:54), y = 1 - degree.distr), color = "darkgray") + 
    geom_point(aes(x = c(1:54), y = 1 - degree.distr), color = "darkorange") + 
    labs(title ="Cumulative Degree Distribution of Subcomponent", x = "Degree", y = "Cumulative Frequency") + 
    scale_x_discrete(limits = c(1:54)) +
    theme_minimal()
print(p)

We see that there are at least 3 or more degree in each vertice (p <= 0.375). Moreover, we can also see that the cumulative frequency approaches 1 as degree approaches 20 or more.

As for the edge density of our subgraph, i.e. the proportion of present edges from all possible edges in the network, the value equates to:

In [14]:
edge_density(sg)
0.00143695057772028

This means that our subgraph is quite sparse and ideally in terms of product copurchases, Amazon would want a highly interconnected network to drive growth and engagement.

Next, we can calculate network level centralization indices for our subgraph:

In [15]:
cat("Network level degree centralization index: ", centr_degree(sg)$centralization)
Network level degree centralization index:  0.02794058

Network level degree centralization index is calculated using:

$Cd = \frac{\sum_{i=0} [cd^* - cd_i]}{max\sum_{i=0}[cd^* - cd_i]}$

Where, $cd^* = max(cd_1, cd_2, ...,cd_n)$

  • 0 ≤ Cd ≤ 1;
  • Cd = 0 when all nodes have the same degree centrality
  • Cd = 1 if one node has maximal degree centrality and all others have minimal.

Hence, our subgraph has nodes that share a similar degree centrality.

In [16]:
cat("Network level closeness centralization index: ", centr_clo(sg, mode = "all")$centralization)
Network level closeness centralization index:  0.1074443

Network level degree closeness centralization index is calculated using:

$Cc = \frac{\sum_{i=0} [cc^* - cc_i]}{max\sum_{i=0}[cc^* - cc_i]}$

Where, $cc^* = max(cc_1, cc_2, ...,cc_n)$

  • 0 ≤ Cc ≤ 1;
  • Cc = 0 when all nodes have the same closeness (node is close, on average, to other nodes).
  • Cc = 1 if one node has largest postible closeness and all others have minimal.

Hence, our subgraph has nodes that share a similar closeness score.

In [17]:
cat("Network level betweeness centralization index: ", centr_betw(sg, directed = TRUE)$centralization)
Network level betweeness centralization index:  0.0003616307

Network level degree betweeness centralization index is calculated using:

$Cb = \frac{\sum_{i=0} [cb^* - cb_i]}{max\sum_{i=0}[cb^* - cb_i]}$

Where, $cb^* = max(cb_1, cb_2, ...,cb_n)$

  • 0 ≤ Cb ≤ 1;
  • Cb = 0 when all nodes have the same betweeness (one that acts as a bridge, broker or gatekeeper)
  • Cb = 1 if one node has largest postible betweeness and all others have minimal.

Hence, our subgraph has nodes that share a similar betweeness score.

In [25]:
cat("Network level eigenvector centralization index: ", centr_eigen(sg, directed = TRUE)$centralization)
Network level eigenvector centralization index:  0.9964208

Network level eigenvector cetralization index dteremines how proportional is the centrality of each vertex is to the sum of the centralities of its neighbors.

i.e. A central actor is connected to other central actors (influence).

Hence, our graph has nodes that are connected to many nodes who themeslves have high centrality scores.

Now, we can also add the local centralization indices (above) to each vertex along with their relative hub/authority scores.

In [26]:
sub.books <- filter(books.graph, id %in% V(sg)$name)
cd <- data.frame("id" = V(sg)$name, "centr_clo" = centr_clo(sg, mode = "all")$res, 
                 "centr_betw" = centr_betw(sg, directed = TRUE)$res, "centr_eigen" = centr_eigen(sg, directed = TRUE)$vector, 
                 "hub_score" = hub_score(sg, weights=NA)$vector, "auth_score" = authority_score(sg, weights=NA)$vector)
sub.books <- merge(sub.books, cd, by = "id")

Next, we calculate the relevant neighbor scores... and add it as columns.

In [28]:
sub.books$nghb_mn_rating <- 0
sub.books$nghb_mn_salesrank <- 0
sub.books$nghb_mn_review_cnt <- 0
for(i in 1:nrow(sub.books)) 
{
    curr.id <- as.character(sub.books[i,]$id)
    nbors <- filter(sub.books, id %in% neighbors(sg, curr.id, mode = c("in"))$name)
    sub.books[i, "nghb_mn_rating"] <- mean(nbors$rating)
    sub.books[i, "nghb_mn_salesrank"] <- mean(nbors$salesrank)
    sub.books[i, "nghb_mn_review_cnt"] <- mean(nbors$review_cnt)
}
In [31]:
select(sub.books, id, salesrank, review_cnt, rating, degree, 
       centr_clo, centr_betw, centr_eigen, hub_score, auth_score, 
       nghb_mn_rating, nghb_mn_salesrank, nghb_mn_review_cnt) %>% head()
idsalesrankreview_cntratingdegreecentr_clocentr_betwcentr_eigenhub_scoreauth_scorenghb_mn_ratingnghb_mn_salesranknghb_mn_review_cnt
33 97166 4 5.0 53 0.14559819 0 1.337641e-160.000000e+001.000000e+004.103774 82153.26 21.075472
77 27012 11 4.5 4 0.08168250 12 2.586353e-172.239872e-164.449831e-174.666667 41744.00 4.000000
78 140480 3 4.5 11 0.10761530 0 1.152630e-151.140518e-175.753636e-044.500000 73179.00 157.818182
130 29460 375 3.5 2 0.09733750 1 2.010698e-185.531568e-042.473186e-174.500000 19415.00 6.000000
148 77008 42 4.0 2 0.09119370 2 6.233478e-183.592652e-052.567663e-170.000000 46701.00 0.000000
187 17104 4 5.0 6 0.09723269 2 7.359310e-165.989914e-042.431071e-054.500000 133546.67 3.666667

Before we actually fit our poisson regression, it's good practice to see the correlation between our predictors.

Here, we'll notice a couple interesting linear relationships:

  • Degree and Betweenness
  • Degree and Authority Score
  • Neighbor mean rating and Neighbor sales rank
In [37]:
sb.model <- select(sub.books, salesrank, review_cnt, rating, degree, 
                   centr_clo, centr_betw, centr_eigen, hub_score, auth_score, 
                   nghb_mn_rating, nghb_mn_salesrank, nghb_mn_review_cnt)
sb.model[is.na(sb.model)] <- 0
In [38]:
M <- cor(sb.model)
corrplot(M, method = "circle")

First attempt at our poisson regression model, we'll pretty much throw all our predictors in as is.

In [39]:
summary(p.model.1 <- glm(salesrank ~ review_cnt + rating + degree + 
                         centr_clo + centr_betw + centr_eigen + hub_score + auth_score + 
                         nghb_mn_rating + nghb_mn_salesrank + nghb_mn_review_cnt, 
                         family="poisson", data = sb.model))
Call:
glm(formula = salesrank ~ review_cnt + rating + degree + centr_clo + 
    centr_betw + centr_eigen + hub_score + auth_score + nghb_mn_rating + 
    nghb_mn_salesrank + nghb_mn_review_cnt, family = "poisson", 
    data = sb.model)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-355.38  -154.27   -13.65   128.00   480.21  

Coefficients:
                     Estimate Std. Error  z value Pr(>|z|)    
(Intercept)         1.136e+01  6.881e-04 16514.09   <2e-16 ***
review_cnt         -3.658e-03  3.495e-06 -1046.44   <2e-16 ***
rating             -2.333e-02  8.293e-05  -281.33   <2e-16 ***
degree              1.262e-02  6.664e-05   189.36   <2e-16 ***
centr_clo          -1.245e-01  6.370e-03   -19.55   <2e-16 ***
centr_betw         -1.478e-03  1.129e-05  -130.86   <2e-16 ***
centr_eigen        -1.388e+00  3.513e-03  -395.24   <2e-16 ***
hub_score           1.250e-01  5.729e-04   218.28   <2e-16 ***
auth_score         -3.578e-01  4.671e-03   -76.59   <2e-16 ***
nghb_mn_rating     -1.646e-02  8.220e-05  -200.29   <2e-16 ***
nghb_mn_salesrank  -6.335e-08  3.747e-09   -16.91   <2e-16 ***
nghb_mn_review_cnt  6.979e-04  1.953e-06   357.34   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 28309713  on 903  degrees of freedom
Residual deviance: 25854671  on 892  degrees of freedom
AIC: 25866173

Number of Fisher Scoring iterations: 5

All the predictors seem to be statistically significant including the intercept.

What about its Model Evaluation Metrics?

In [40]:
cat("Mean Absolute Error", mad(residuals(p.model.1)))
Mean Absolute Error 210.0388
In [41]:
cat("Mean Squared Error", mean(residuals(p.model.1)^2))
Mean Squared Error 28600.3
In [42]:
cat("Root Mean Squared Error", sqrt(mean(residuals(p.model.1)^2)))
Root Mean Squared Error 169.1162

One could be satisfied with the model fitted above with relatively low error metrics.

But, before we decide whether this regression model is a good fit, we will need to conduct residual analysis.

In [43]:
options(warn=-1)
par(mfrow = c(2, 2))
plot(p.model.1)

Above, we are concered with the two graphs (Residual vs Fitted) and (Scale-Location).

Ideally, we want our residuals to:

  • Be pretty symmetrically distributed, tending to cluster towards the middle of the plot
  • Be clustered around the lower single digits of the y-axis
  • Not have any clear pattern (random)

However, our residual plots seems to be skewed and unbalanced.

Hence, we have a couple alternatives to improve our regression model:

  • Remove any irrelevant predictors in case of multicollinearity (we want a parsimonious model)
  • Transform existing predictors by taking their logs (residuals are skewed.. meaning our predictors might be highly skewed as well)

To detect multicollinearity, we can calcuate their Variance Inflation Factor (VIF).

Ideally, we want the VIFs of our predictors to be less than 5. If it is higher than 10, that means that two or more predictors might be correlated with each other

In [44]:
vif(p.model.1)
review_cnt
1.01376827586218
rating
1.01624100413157
degree
3.38248618639868
centr_clo
1.28836569078437
centr_betw
2.05323161234919
centr_eigen
1.04496552452874
hub_score
1.24875576382603
auth_score
2.08736727022356
nghb_mn_rating
2.06846774492615
nghb_mn_salesrank
1.94348783215824
nghb_mn_review_cnt
1.07941037252909

It seems like all our predictors' VIFs are less than 5, meaning we do not need to remove any as there is no issue of multicollinearity in our model.

Next, we visualize the histograms of our predictors (to see if they are skewed).

In [45]:
ggplot(gather(sb.model), aes(value)) + 
    geom_histogram(bins = 50) + 
    facet_wrap(~key, scales = 'free_x')

We see that almost all of them except rating and closeness centrality are not skewed.

We can now revise our poisson regression model by taking logs of all the predictors except rating and closeness centrality.

In [46]:
summary(p.model.2 <- glm(salesrank ~ log(review_cnt + 1) + rating + log(degree + 1) + centr_clo + log(centr_betw + 1) + 
                         log(centr_eigen + 1) + log(hub_score + 1) + log(auth_score + 1) + log(nghb_mn_rating + 1) + 
                         log(nghb_mn_salesrank + 1) + log(nghb_mn_review_cnt + 1), family="poisson", data = sb.model))
Call:
glm(formula = salesrank ~ log(review_cnt + 1) + rating + log(degree + 
    1) + centr_clo + log(centr_betw + 1) + log(centr_eigen + 
    1) + log(hub_score + 1) + log(auth_score + 1) + log(nghb_mn_rating + 
    1) + log(nghb_mn_salesrank + 1) + log(nghb_mn_review_cnt + 
    1), family = "poisson", data = sb.model)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-368.97  -159.05   -19.72   127.55   389.75  

Coefficients:
                              Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                  1.142e+01  7.765e-04 14712.55   <2e-16 ***
log(review_cnt + 1)         -1.534e-01  1.084e-04 -1415.14   <2e-16 ***
rating                       2.595e-02  9.014e-05   287.85   <2e-16 ***
log(degree + 1)              1.550e-02  4.550e-04    34.06   <2e-16 ***
centr_clo                   -2.496e-01  6.398e-03   -39.01   <2e-16 ***
log(centr_betw + 1)          6.396e-03  1.698e-04    37.66   <2e-16 ***
log(centr_eigen + 1)        -1.647e+00  4.512e-03  -365.14   <2e-16 ***
log(hub_score + 1)           1.653e-01  8.184e-04   202.00   <2e-16 ***
log(auth_score + 1)          2.257e-01  5.078e-03    44.45   <2e-16 ***
log(nghb_mn_rating + 1)     -8.139e-02  3.941e-04  -206.54   <2e-16 ***
log(nghb_mn_salesrank + 1)  -4.407e-03  5.530e-05   -79.68   <2e-16 ***
log(nghb_mn_review_cnt + 1)  4.826e-02  1.362e-04   354.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 28309713  on 903  degrees of freedom
Residual deviance: 25511903  on 892  degrees of freedom
AIC: 25523406

Number of Fisher Scoring iterations: 5
In [47]:
cat("Mean Absolute Error", mad(residuals(p.model.2)))
Mean Absolute Error 211.3972
In [48]:
cat("Mean Squared Error", mean(residuals(p.model.2)^2))
Mean Squared Error 28221.13
In [49]:
cat("Root Mean Squared Error", sqrt(mean(residuals(p.model.2)^2)))
Root Mean Squared Error 167.9915

We see that the revised model has a lower Root Mean Squared Error ableit the MSE and MAE are similar to the previous model.

Also, all the predictors are statistically signifnicant along with the intercept as well.

Moving on to the residual analysis...

In [50]:
par(mfrow = c(2, 2))
plot(p.model.2)

Now, though not completely perfect, we can see that the residual plots are closer to what is desired (randomly distributed).

Hence, the revised poisson model is much preferred and a better fit compared to the first model.